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Abstract 


This paper describes an alignment-based 
model for interpreting natural language in¬ 
structions in context. We approach in¬ 
struction following as a search over plans, 
scoring sequences of actions conditioned 
on structured observations of text and the 
environment. By explicitly modeling both 
the low-level compositional structure of 
individual actions and the high-level struc¬ 
ture of full plans, we are able to learn 
both grounded representations of sentence 
meaning and pragmatic constraints on in¬ 
terpretation. To demonstrate the model’s 
flexibility, we apply it to a diverse set 
of benchmark tasks. On every task, we 
outperform strong task-specific baselines, 
and achieve several new state-of-the-art 
results. 


1 Introduction 


In instruction-following tasks, an agent executes 
a sequence of actions in a real or simulated envi¬ 
ronment, in response to a sequence of natural lan¬ 
guage commands. Examples include giving nav¬ 
igational directions to robots and providing hints 
to automated game-playing agents. Plans speci¬ 
fied with natural language exhibit compositional- 
ity both at the level of individual actions and at 
the overall sequence level. This paper describes a 
framework for learning to follow instructions by 
leveraging structure at both levels. 

Our primary contribution is a new, alignment- 
based approach to grounded compositional se¬ 
mantics. Building on related logical approaches 
( Reddy et ah, 201^ Pourdamghani et ah, 20141, 
we recast instruction following as a pair of nested, 
structured alignment problems. Given instructions 
and a candidate plan, the model infers a sequence- 
to-sequence alignment between sentences and 


atomic actions. Within each sentence-action pair, 
the model infers a structure-to-structure alignment 
between the syntax of the sentence and a graph- 
based representation of the action. 

At a high level, our agent is a block-structured, 
graph-valued conditional random field, with align¬ 
ment potentials to relate instructions to actions and 
transition potentials to encode the environment 
model (Figure 3l. Explicitly modeling sequence- 
to-sequence alignments between text and actions 
allows flexible reasoning about action sequences, 
enabling the agent to determine which actions are 
specified (perhaps redundantly) by text, and which 
actions must be performed automatically (in or¬ 
der to satisfy pragmatic constraints on interpreta¬ 
tion). Treating instruction following as a sequence 
prediction problem, rather than a series of inde¬ 
pendent decisions (Branavan et ah, 2009t Artzi 


and Zettlemoyer, 2013|), makes it possible to use 


general-purpose planning machinery, greatly in¬ 
creasing inferential power. 

The fragment of semantics necessary to com¬ 
plete most instruction-following tasks is essen¬ 
tially predicate-argument structure, with limited 
influence from quantification and scoping. Thus 
the problem of sentence interpretation can reason¬ 
ably be modeled as one of finding an alignment be¬ 
tween language and the environment it describes. 
We allow this structure-to-structure alignment— 
an “overlay” of language onto the world—to be 
mediated by linguistic structure (in the form of 
dependency parses) and structured perception (in 
what we term grounding graphs). Our model 
thereby reasons directly about the relationship be¬ 
tween language and observations of the environ¬ 
ment, without the need for an intermediate logi¬ 
cal representation of sentence meaning. This, in 
turn, makes it possible to incorporate flexible fea¬ 
ture representations that have been difficult to in¬ 
tegrate with previous work in semantic parsing. 

We apply our approach to three established 











... right round the white water but 
stay quite close ’cause you don’t 
otherwise you ’re going to be in that 
stone creek... 

(a) Map reading 



Go down the yellow hall. Turn left 
at the intersection of the yellow and 
the gray. 

(b) Maze navigation 



Clear the right column. Then the 
other column. Then the row. 


(c) Puzzle solving 


Figure 1: Example tasks handled by our framework. The tasks feature noisy text, over- and under-specification of plans, and 
challenging search problems. 


instruction-following benchmarks: the map read¬ 
ing task of Vogel and Jurafsky (20101, the maze 
navigation task of MacMahon et al. (20061, and 
the puzzle solving task of Branavan et al. (2009||. 


An example from each is shown in Figure 1 


These benchmarks exhibit a range of qualitative 
properties—both in the length and complexity of 
their plans, and in the quantity and quality of ac¬ 
companying language. Each task has been stud¬ 
ied in isolation, but we are unaware of any pub¬ 
lished approaches capable of robustly handling 
all three. Our general model outperforms strong, 
task-specific baselines in each case, achieving 
relative error reductions of 15-20% over sev¬ 
eral state-of-the-art results. Experiments demon¬ 
strate the importance of our contributions in both 
compositional semantics and search over plans. 
We have released all code for this project at 
github.com/jacobandreas/instructions. 


2 Related work 

Existing work on instruction following can be 
roughly divided into two families: semantic 
parsers and linear policy estimators. 


Semantic parsers Parser-based approaches 


(Chen and Mooney, 201 1[ Artzi and Zettlemoyer, 

2013 

Kim and Mooney, 2013 Tellex et al.. 

2011 

1 map from text into a formal language 


representing commands. These take familiar 
structured prediction models for semantic parsing 


and Collins, 2005, 

I, and train them with task-provided 
supervision. Instead of attempting to match the 
structure of a manually-annotated semantic parse, 
semantic parsers for instruction following are 


Wong and 


(|Zettlemoyer 


Mooney, 2006 


trained to maximize a reward signal provided by 
black-box execution of the predicted command 
in the environment. (It is possible to think of 
response-based learning for question answering 
( Eiang et al, 2013| l as a special case.) 

This approach uses a well-studied mechanism 
for compositional interpretation of language, but is 
subject to certain limitations. Because the environ¬ 
ment is manipulated only through black-box exe¬ 
cution of the completed semantic parse, there is no 
way to incorporate current or future environment 
state into the scoring function. It is also in general 
necessary to hand-engineer a task-specific formal 
language for describing agent behavior. Thus it is 
extremely difficult to work with environments that 
cannot be modeled with a fixed inventory of pred¬ 
icates (e.g. those involving novel strings or arbi¬ 
trary real quantities). 

Much of contemporary work in this family is 
evaluated on the maze navigation task introduced 
by MacMahon et al. (20061. Dukes (20131 also in¬ 
troduced a “blocks world” task for situated parsing 
of spatial robot commands. 


Linear policy estimators An alternative fam¬ 
ily of approaches is based on learning a pol- 
icy over primitive actions directly ([Branavan et 


al., 2009; Vogel and Jurafsky, 2010)^ Policy - 


based approaches instantiate a Markov decision 
process representing the action domain, and ap¬ 
ply standard supervised or reinforcement-learning 
approaches to learn a function for greedily select¬ 
ing among actions. In linear policy approximators, 
natural language instructions are incorporated di- 


'This is distinct from semantic parsers in which greedy 
inference happens to have an interpretation as a policy ( |V1^'I 
chos and Clark, 2014 1 . 



























































rectly into state observations, and reading order 
becomes part of the action selection process. 

Almost all existing policy-learning approaches 
make use of an unstructured parameterization, 
with a single (flat) feature vector representing all 
text and observations. Such approaches are thus 
restricted to problems that are simple enough (and 
have small enough action spaces) to be effectively 
characterized in this fashion. While there is a great 
deal of flexibility in the choice of feature func¬ 
tion (which is free to inspect the current and fu¬ 
ture state of the environment, the whole instruc¬ 
tion sequence, etc.), standard linear policy estima¬ 
tors have no way to model compositionality in lan¬ 
guage or actions. 

Agents in this family have been evaluated on a 
variety of tasks, including map reading ( [Anderson 
et al., 199T]) and gameplay (Branavan et al., 20091. 


(Branavan et al., 2009 


Though both families address the same class 
of instruction-following problems, they have been 
applied to a totally disjoint set of tasks. It should 
be emphasized that there is nothing inherent to 
policy learning that prevents the use of composi¬ 
tional structure, and nothing inherent to general 
compositional models that prevents more compli¬ 
cated dependence on environment state. Indeed, 


previous work (Branavan et al., 2011 Narasimhan 


et al., 2015]) uses aspects of both to solve a differ¬ 


ent class of gameplay problems. In some sense, 
our goal in this paper is simply to combine the 
strengths of semantic parsers and linear policy es¬ 
timators for fully general instruction following. 
As we shall see, however, this requires changes 
to many aspects of representation, learning and in¬ 
ference. 


3 Representations 

We wish to train a model capable of following 
commands in a simulated environment. We do so 
by presenting the model with a sequence of train¬ 
ing pairs (x, y), where each x is a sequence of nat¬ 
ural language instructions (xi, X 2 , ■ •., Xm), e.g.: 

{Go down the yellow hall ., Turn left ., ...) 

and each y is a demonstrated action sequence 
(yi,y 2 , • • • ,yn), e.g.: 

(rotate(90), move(2), ...) 

Given a start state, y can equivalently be char¬ 
acterized by a sequence of (state, action, state) 


(a) Text 


Go down the yellow hall 


(b) Syntax 



* go down the yellow hall 


(c) Alignment 



(d) Perception 0 - 0-0 

0 


(e) Environment 


move(2) 




Figure 2: Structure-to-structure alignment connecting a sin¬ 
gle sentence (via its syntactic analysis) to the environment 
state (via its grounding graph). The connecting alignments 
take the place of a traditional semantic parse and allow flexi¬ 
ble, feature-driven linking between lexical primitives and per¬ 
ceptual factors. 


triples resulting from execution of the environ¬ 
ment model. An example instruction is shown in 


Figure 2 1 . An example action, situated in the en¬ 


vironment where it occurs, is shown in Figure 2^ . 

Our model performs compositional interpreta¬ 
tion of instructions by leveraging existing struc¬ 
ture inherent in both text and actions. Thus we 
interpret Xi and yj not as raw strings and primitive 
actions, but rather as structured objects. 

Linguistic structure We assume access to a pre¬ 
trained parser, and in particular that each of the 
instructions Xi is represented by a tree-structured 


dependency parse. An example is shown in Fig- 
lure 2b . 


Action structure By analogy to the represen¬ 
tation of instructions as parse trees, we assume 
that each (state, action, state) triple (provided by 
the environment model) can be characterized by a 
grounding graph^ The structure and content of 


We note that the instruction following model of Tellex et 
|al. (2011} features a similarly named “Generalized Ground- 




























this representation is task-specific. An example 
grounding graph for the maze navigation task is 
shown in Figure 2| i. The example contains a node 
corresponding to the primitive action move (2) 
(in the upper left), and several nodes correspond¬ 
ing to locations in the environment that are visible 
after the action is performed. 

Each node in the graph (and, though not de¬ 
picted, each edge) is decorated with a list of fea¬ 
tures. These features might be simple indica¬ 
tors (e.g. whether the primitive action performed 
was move or rotate), real values (the distance 
traveled) or even string-valued (English-language 
names of visible landmarks, if available in the 
environment description). Eormally, a grounding 
graph consists of a tuple {V, E, C, /y , Je), with 

- F a set of vertices 

- G F X F a set of (directed) edges 

- £ a space of labels (numbers, strings, etc.) 

- /y : F —)■ 2^ a vertex feature function 

- /e : £ —>• 2^ an edge feature function 

In this paper we have tried to remain agnostic 
to details of graph construction. Our goal with the 
grounding graph framework is simply to accom¬ 
modate a wider range of modeling decisions than 
allowed by existing formalisms. Graphs might 
be constructed directly, given access to a struc¬ 
tured virtual environment (as in all experiments 
in this paper), or alternatively from outputs of a 
perceptual system. Eor our experiments, we have 
remained as close as possible to task representa¬ 
tions described in the existing literature. Details 
for each task can be found in the accompanying 
software package. 

Graph-based representations are extremely 
common in formal semantics ( Jones et ah, 2012] 
Reddy et ah, 2014| |, and the version presented here 
corresponds to a simple generalization of famil¬ 
iar formal methods. Indeed, if C is the set of all 
atomic entities and relations, /y returns a unique 
label for every v G V, and Je always returns 
a vector with one active feature, we recover the 
existentially-quantified portion of first order logic 
exactly, and in this form can implement large parts 
of classical neo-Davidsonian semantics ([Parsons, | 


19901 using grounding graphs. 


ing Graph” (G®) formalism. A G^ links the syntax of the in¬ 
put command to the action ultimately executed, and is thus 
more analogous to our structured alignment variable ( |Fig-| 
ure 2;) than our perceptual representation. 


Crucially, with an appropriate choice of C this 
formalism also makes it possible to go beyond set- 
theoretic relations, and incorporate string-valued 
features (like names of entities and landmarks) and 
real-valued features (like colors and positions) as 
well. 


Lexical semantics We must eventually combine 
features provided by parse trees with features pro¬ 
vided by the environment. Examples here might 
include simple conjunctions (word=yellow A 
rgb= (0.5, 0.5, 0.0)) or more compli¬ 
cated computations like edit distance between 
landmark names and lexical items. Eeatures of 
the latter kind make it possible to behave correctly 
in environments containing novel strings or other 
features unseen during training. 

This aspect of the syntax-semantics inter¬ 
face has been troublesome for some logic-based 
approaches: while past work has used related 
machinery for selecting lexicon entries (jBerant 


and Eiang, 2014]) or for rewriting logical forms 


( jKwiatkowski et ah, 2013 1, the relationship be¬ 
tween text and the environment has ultimately 
been mediated by a discrete (and indeed finite) in¬ 
ventory of predicates. Several recent papers have 
investigated simple grounded models with real¬ 


valued output spaces (Andreas and Klein, 2014 


McMahan and Stone, 2015]), but we are unaware 


of any fully compositional system in recent lit¬ 
erature that can incorporate observations of these 
kinds. 

Eormally, we assume access to a joining feature 
function cj) : (2^ x 2^) —^ As with grounding 

graphs, our goal is to make the general framework 
as flexible as possible, and for individual exper- 
imenfs have chosen cj) fo emulate modeling deci¬ 
sions from previous work. 


4 Model 

As nofed in fhe infroducfion, we approach insfruc- 
fion following as a sequence predicfion problem. 
Thus we musf place a disfribufion over sequences 
of acfions condifioned on insfrucfions. We decom¬ 
pose fhe problem info fwo componenfs, describing 
inferlocking models of “pafh sfrucfure” and “ac¬ 
tion sfrucfure”. Pafh sfrucfure capfures how se¬ 
quences of insfrucfions give rise fo sequences of 
acfions, while action sfrucfure capfures fhe com¬ 
positional relafionship befween individual uffer- 
ances and fhe acfions fhey specify. 































Text 


Go down the yellow hall. 


Alignments 


Plans 


Turn left. 



Figure 3: Our model is a conditional random field that de¬ 
scribes distributions over state-action sequences conditioned 
on input text. Each variable’s domain is a structured value. 
Sentences align to a subset of the state-action sequences, 
with the rest of the states filled in by pragmatic (planning) 
implication. State-to-state structure represents planning con¬ 
straints (environment model) while state-to-text structure rep¬ 
resents compositional alignment. All potentials are log-linear 
and feature-driven. 


Path structure: aligning utterances to actions 

The high-level path structure in the model is de¬ 
picted in Figure 3| Our goal here is to permit both 
under- and over-specification of plans, and to ex¬ 
pose a planning framework which allows plans to 
be computed with lookahead (i.e. non-greedily). 

These goals are achieved by introducing a se¬ 
quence of latent alignments between instructions 
and actions. Consider the multi-step example in 


Figure lb If the first instruction go down the yel¬ 


low hall were interpreted immediately, we would 
have a presupposition failure—the agent is facing 
a wall, and cannot move forward at all. Thus an 
implicit rotate action, unspecified by fexf, musf 
be performed before any explicif insfrucfions can 
be followed. 

To model fhis, we lake fhe probabilify of a (lexl, 
plan, alignmenl) Iriple lo be log-proportional fo 
fhe sum of Iwo quanlilies: 


1. a pafh-only score 9) + 'ip{yj-,9) 

2. a palh-and-lexl score, ilself fhe sum of all pair 
scores ip{xi,yj] 9) licensed by fhe alignmenl 

(1) caplures our desire for pragmafic consfrainfs 
on inferprelalion, and provides a means of encod¬ 
ing fhe inherenl plausibilily of palhs. We lake 
ip{n]9) and ip{y,9) lo be linear funclions of 9. 

(2) provides conlexl-dependenl inferprelalion of 
lexl by means of Ihe slruclured scoring funclion 
il^{x,y]9), described in Ihe nexl seclion. 

Formally, we associate wilh each inslruclion Xi 
a sequence-lo-sequence alignmenl variable ai G 


1... n (recalling lhal n is Ihe number of aclions). 
Then we hav^ 

f 

p(y,a|x; 9) oc exp < 'ip{n) + E ^(Vj) 

^ i=i 

m n X 

+ ^^l[aj =1] i (1) 

i=i j=i J 


We additionally place a monolonicily conslrainl 
on Ihe alignmenl variables. This model is globally 
normalized, and for a fixed alignmenl is equiva- 
lenl lo a linear-chain CRF. In Ihis sense il is analo¬ 
gous lo IBM Model I ( [Brown el ah, 1993 1, wilh Ihe 
slruclured potentials 'ijj{xi, yj) faking Ihe place of 
lexical Iranslalion probabilities. While alignmenl 
models from machine Iranslalion have previously 
been used lo align words lo fragments of semantic 
parses (Wong and Mooney, 2006 jPourdamghani 


et ah, 2014)), we are unaware of such models be¬ 


ing used to align entire instruction sequences to 
demonstrations. 


Action structure: aligning words to percepts 


Intuitively, this scoring function 'ip{x,y) should 
capture how well a given utterance describes an 
action. If neither the utterances nor the actions had 
structure (i.e. both could be represented with sim¬ 
ple bags of features), we would recover something 
analogous to the conventional policy-learning ap¬ 
proach. As structure is essential for some of our 
tasks, 'i/)(x, y) must instead fill Ihe role of a seman¬ 
tic parser in a conventional compositional model. 

Our choice of ^|;{x, y) is driven by Ihe following 
fundamenlal assumptions: Syntactic relations ap¬ 
proximately represent semantic relations. Syntac¬ 
tic proximity implies relational proximity. In Ihis 
view, Ihere is an additional hidden slruclure-lo- 
slruclure alignmenl belween Ihe grounding graph 
and Ihe parsed lexl describing il. Words line up 
wilh nodes, and dependencies line up wilh rela¬ 
tions. Visualizations are shown in Figure 21; and 


Ihe zoomed-in portion of Figure 3 


As wilh Ihe lop-level alignmenl variables, Ihis 
approach can viewed as a simple relaxation of a 
familiar model. CCG-based parsers assume lhal 


^Here and in the remainder of this paper, we suppress the 
dependence of the various potentials on 6 in the interest of 
readability. 

'^It is formally possible to regard the sequence-to- 
sequence and structure-to-structure alignments as a single 
(structured) random variable. However, the two kinds of 
alignments are treated differently for purposes of inference, 
so it is useful to maintain a notational distinction. 





















syntactic type strictly determines semantic type, 
and that each lexical item is associated with a 
small set of functional forms. Here we simply 
allow all words to license all predicates, multi¬ 
ple words to specify the same predicate, and some 
edges to be skipped. We instead rely on a scoring 
function to impose soft versions of the hard con¬ 
straints typically provided by a grammar. Related 
models have previously been used for question an¬ 
swering (Reddy et ah, 2014} Pasupat and Liang, 


20T5] ). 

For the moment let us introduce variables b 
to denote these structure-to-structure alignments. 
(As will be seen in the following section, it is 
straightforward to marginalize over all choices of 
b. Thus the structure-to-structure alignments are 
never explicitly instantiated during inference, and 
do not appear in the final form of For 

a fixed alignment, we define according 

to a recurrence relation. Let a:® be the fth word of 
the sentence, and let y^ be the jth node in the ac¬ 
tion graph (under some topological ordering). Let 
c{i) and c{j) give the indices of the dependents of 
X® and children of y^ respectively. Finally, let x®^ 
and y^^ denote the associated dependency type or 
relation. Define a “descendant” function: 


d{i,j) = : A: G c(i), I G c{j), {k,l) G b} 

Then, 


'^{x\y^,b) = exp 



(t){x\y^) 


+ E 

(fc,/)Gd(3:,y) 





This is just an unnormalized synchronous deriva¬ 
tion between x and y —at any aligned (node, word) 
pair, the score for the entire derivation is the score 
produced by combining that word and node, times 
the scores at all the aligned descendants. Observe 
that as long as there are no cycles in the depen¬ 
dency parse, it is perfectly acceptable for the rela¬ 
tion graph to contain cycles and even self-loops— 
the recurrence still bottoms out appropriately. 


5 Learning and inference 

Given a sequence of training pairs (x, y), we 
wish to find a parameter setting that maximizes 
p(y|x;0). If there were no latent alignments a 
or b, this would simply involve minimization of 
a convex objective. The presence of latent vari¬ 
ables complicates things. Ideally, we would like 


Algorithm 1 Computing structure-to-structure 
alignments 


X® are words in reverse topological order 
y^ are grounding graph nodes (root last) 
chart is an m X n array 
for f = 1 to |x| do 
for j = 1 to |y| do 

score t— exp |0'''(/)(x®, y-^)} 
for {k, 1) G d{i,j) do 

s ^ Eiec(j) 


exp {0'''(/)(x®^, y-^0} 


• chart[k, 1] 
score ^ score ■ s 

end for 

chart[i,j] •(— score 

end for 
end for 

return chart[n, m] 


to sum over the latent variables, but that sum is in¬ 
tractable. Instead we make a series of variational 
approximations: first we replace the sum with a 
maximization, then perform iterated conditional 
modes, alternating between maximization of the 
conditional probability of a and 6. We begin by 
initializing 9 randomly. 


As noted in the preceding section, the vari¬ 
able b does not appear in these equations. Con¬ 
ditioned on a, the sum over structure-to-structure 
'ib{x,y) = performed ex¬ 

actly using a simple dynamic program which runs 
in time C>(|x||?/|) (assuming out-degree bounded 
by a constant, and with |x| and |y| the number of 


words and graph nodes respectively). This is Al- 
gorithm 1] 


In our experiments, 6 is optimized using L- 
BFGS ( |Liu and Nocedal, 1989[ l. Calculation of 
the gradient with respect to 9 requires computa¬ 
tion of a normalizing constant involving the sum 
over p(x, y', a) for all y'. While in principle the 
normalizing constant can be computed using the 
forward algorithm, in practice the state spaces un¬ 
der consideration are so large that even this is in¬ 
tractable. Thus we make an additional approxima¬ 
tion, constructing a set Y of alternative actions and 
taking 


n 

p(y,a|x) w 


exp {i’{yj)+YT=l 

Ejjei> exp {'>P{y)+YT=i 





















V is constructed by sampling alternative actions 
from the environment model. Meanwhile, maxi¬ 
mization of a can be performed exactly using the 
Viterbi algorithm, without computation of normal- 
izers. 

Inference at test time involves a slightly differ¬ 
ent pair of optimization problems. We again per¬ 
form iterated conditional modes, here on the align¬ 
ments a and the unknown output path y. Max¬ 
imization of a is accomplished with the Viterbi 
algorithm, exactly as before; maximization of y 
also uses the Viterbi algorithm, or a beam search 
when this is computationally infeasible. If bounds 
on path length are known, it is straightforward to 
adapt these dynamic programs to efficiently con¬ 
sider paths of all lengths. 

6 Evaluation 



P 

R 

Fi 

Vogel and Jurafsky (2010| 

0.46 

0.51 

0.48 

Andreas and Klein (2Ui4| 

0.43 

0.51 

0.45 

Model [no planning] 

0.44 

0.46 

0.45 

Model [no grounding graphs] 

0.52 

0.52 

0.52 

Model [full] 

0.51 

0.60 

0.55 


Table 1: Evaluation results for the map-reading task. P is pre¬ 
cision, R is recall and Fi is F-measure. Scores are calculated 
with respect to transitions between landmarks appearing in 
the reference path (for details see | Vogel and Jurafsky (201^1 ). 
We use the same train / test split. Some variant of our model 
achieves the best published results on all three metrics. 


Feature Weight 


word=top 

A 

side=North 

1.31 

word=top 

A 

side=South 

0.61 

word=top 

A 

side=East 

-0.93 

dist=0 



4.51 

dist=l 



2.78 

dist=4 



1.54 


As one of the main advantages of this approach 
is its generality, we evaluate on several different 
benchmark tasks for instruction following. These 
exhibit great diversity in both environment struc¬ 
ture and language use. We compare our full 
system to recent state-of-the-art approaches to 
each task. In the introduction, we highlighted 
two core aspects of our approach to semantics: 
compositionality (by way of grounding graphs 
and structure-to-structure alignments) and plan¬ 
ning (by way of inference with lookahead and 
sequence-to-sequence alignments). To evaluate 
these, we additionally present a pair of ablation ex¬ 
periments: no grounding graphs (an agent with an 
unstructured representation of environment state), 
and no planning (a reflex agent with no looka¬ 
head). 


Map reading Our first application is the map 
navigation task established by [Vogel and Jurafsky 


(20101, based on data collected for a psychological 


experiment by [Anderson et al. (1991 1 ( [Figure la I. 
Each training datum consists of a map with a des¬ 
ignated starting position, and a collection of land¬ 
marks, each labeled with a spatial coordinate and 
a string name. Names are not always unique, and 
landmarks in the test set are never observed dur¬ 
ing training. This map is accompanied by a set 
of instructions specifying a path from the start¬ 
ing position to some (unlabeled) destination point. 
These instruction sets are informal and redundant, 
involving as many as a hundred utterances. They 
are transcribed from spoken text, so grammatical 
errors, disfluencies, etc. are common. This is a 


Table 2: Learned feature values. The model learns that the 
word top often instructs the navigator to position itself above 
a landmark, occasionally to position itself below a landmark, 
but rarely to the side. The bottom portion of the table shows 
learned text-independent constraints: given a choice, near 
destinations are preferred to far ones (so shorter paths are pre¬ 
ferred overall). 


prime example of a domain that does not lend it¬ 
self to logical representation—grammars may be 
too rigid, and previously-unseen landmarks and 
real-valued positions are handled more easily with 
feature machinery than predicate logic. 


The map task was previously studied by Vo¬ 


gel and Jurafsky (2010 1, who implemented SARSA 
with a simple set of features. By combining these 
features with our alignment model and search pro¬ 
cedure, we achieve state-of-the-art results on this 


task by a substantial margin (Table 1 1 . 

Some learned feature values are shown in tla: 


ble 2 The model correctly infers cardinal direc¬ 
tions (the example shows the preferred side of a 
destination landmark modified by the word top). 
Like Vogel et al., we see support for both allocen- 
tric references (you are on top of the hill) and ego¬ 
centric references (the hill is on top of you). We 
can also see pragmatics at work: the model learns 
useful text-independent constraints—in this case, 
that near destinations should be preferred to far 
ones. 

Maze navigation The next application we con¬ 


sider is the maze navigation task of MacMahon et 
al. (20061 (Figure lb). Here, a virtual agent is sit- 


































Success (%) 


|Kim and Mooney (2012 

1 

57.2 

Chen (20i2 1 


57.3 

Model [no planning] 


58.9 

Model [no grounding graphs] 

51.7 

Model [full] 


59.6 



Match (%) 

Success (%) 

No text 

54 

78 

Branavan ’09 

63 

- 

Model [no planning] 

64 

66 

Model [full] 

70 

86 


Kim and Mooney (2013 i [reranked] 
Artzi et al, (2U14[l' [semi- supervised] 


62.8 

65.3 


Table 3: Evaluation results for the maze navigation task. 
“Success” shows the percentage of actions resulting in a cor¬ 
rect position and orientation after observing a single instruc¬ 
tion. We use the leave-one-map-out evaluation employed by 
previous work|^ All systems are trained on full action se¬ 
quences. Our model outperforms several task-specific base¬ 
lines, as well as a baseline with path structure but no action 
structure. 


uated in a maze (whose hallways are distinguished 
with various wallpapers, carpets, and the presence 
of a small set of standard objects), and again given 
instructions for getting from one point to another. 
This task has been the subject of focused attention 
in semantic parsing for several years, resulting in 
a variety of sophisticated approaches. 

Despite superficial similarity to the previous 
navigation task, the language and plans required 
for this task are quite different. The proportion of 
instructions to actions is much higher (so redun¬ 
dancy much lower), and the interpretation of lan¬ 
guage is highly compositional. 

As can be seen in Table 3[ we outperform a 
number of systems purpose-built for this naviga¬ 
tion task. We also outperform both variants of 
our system, most conspicuously the variant with¬ 
out grounding graphs. This highlights the impor¬ 
tance of compositional structure. Recent work by 
Kim and Mooney (2013| l and Artzi et al. (2014| ) 


has achieved better results; these systems make 
use of techniques and resources (respectively, dis¬ 
criminative reranking and a seed lexicon of hand- 
annotated logical forms) that are largely orthogo¬ 
nal to the ones used here, and might be applied to 
improve our own results as well. 


Table 4: Results for the puzzle solving task. “Match” shows 
the percentage of predicted action sequences that exactly 
match the annotation. “Success” shows the percentage of 
predicted action sequences that result in a winning game con¬ 
figuration, regardless of the action sequence performed. Fol- 
lowing [Branavan et al. (200^ , we average across five random 
train / test folds. Our model achieves state-of-the-art results 
on this task. 


representation, so there is no distinction between 
the full model and the variant without grounding 
graphs. 

Unlike the other tasks we consider. Crossblock 
is distinguished by a challenging associated search 
problem. Here it is nontrivial to find any sequence 
that eliminates all the blocks (the goal of the puz¬ 
zle). Thus this example allows us measure the ef¬ 
fectiveness of our search procedure. 

Results are shown in [Table 4| As can be seen, 
our model achieves state-of-the-art performance 
on this task when attempting to match the human- 
specified plan exactly. If we are purely concerned 
with task completion (i.e. solving the puzzle, per¬ 
haps not with the exact set of moves specified 
in the instructions) we can measure this directly. 
Here, too, we substantially outperform a no-text 
baseline. Thus it can be seen that text induces a 
useful heuristic, allowing the model to solve a con¬ 
siderable fraction of problem instances not solved 
by naive beam search. 

The problem of inducing planning heuristics 
from side information like text is an important 
one in its own right, and future work might focus 
specifically on coupling our system with a more 
sophisticated planner. Even at present, the re¬ 
sults in this section demonstrate the importance of 
lookahead and high-level reasoning in instruction 
following. 


Puzzle solving The last task we consider is the 
Crossblock task studied by Branavan et al. (20091 
(Figure Icl. Here, again, natural language is used 
to specify a sequence of actions, in this case the 
solution to a simple game. The environment is 
simple enough to be captured with a flat feature 


^We specifically targeted the single-sentence version of 
this evaluation, as an alternative full-sequence evaluation 
does not align precisely with our data condition. 


7 Conclusion 

We have described a new alignment-based com¬ 
positional model for following sequences of nat¬ 
ural language instructions, and demonstrated the 
effectiveness of this model on a variety of tasks. A 
fully general solution to the problem of contextual 
interpretation must address a wide range of well- 
studied problems, but the work we have described 





































here provides modular interfaces for the study of 
a number of fundamental linguistic issues from a 
machine learning perspective. These include: 


Pragmatics How do we respond to presup¬ 
position failures, and choose among possible 
interpretations of an instruction disambiguated 
only by context? The mechanism provided by 
the sequence-prediction architecture we have de¬ 
scribed provides a simple answer to this ques¬ 
tion, and our experimental results demonstrate that 
the learned pragmatics aid interpretation of in¬ 
structions in a number of concrete ways: am¬ 
biguous references are resolved by proximity in 
the map reading task, missing steps are inferred 
from an environment model in the maze naviga¬ 
tion task, and vague hints are turned into real plans 
by knowledge of the rules in Crossblock. A more 
comprehensive solution might explicitly describe 
the process by which instruction-givers’ own be¬ 
liefs (expressed as distributions over sequences) 
give rise to instructions. 


Compositional semantics The graph alignment 
model of semantics presented here is an expres¬ 
sive and computationally efficient generalization 
of classical logical techniques to accommodate en¬ 
vironments like the map task, or those explored 
in our previous work (Andreas and Klein, 20141. 
More broadly, our model provides a compositional 
approach to semantics that does not require an 
explicit formal language for encoding sentence 
meaning. Future work might extend this approach 
to tasks like question answering, where logic- 
based approaches have been successful. 


Our primary goal in this paper has been to ex¬ 
plore methods for integrating compositional se¬ 
mantics and the pragmatic context provided by se¬ 
quential structures. While there is a great deal 
of work left to do, we find if encouraging fhaf 
fhis general approach resulfs in subsfanfial gains 
across mulfiple fasks and confexfs. 
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